Introduction

The film “the wandering earth”, which has attracted a lot of attention recently, sets up an extremely harsh climate which is close to collapse after a great change of climate. Such a catactrophic scenario is not just a figment of the bad weather that is common in many places. Climate change and potential global warming have become the significant issue of the day, monitoring temperatures across the region, therefore, is becoming increasingly important in response to weather-related disasters.

The data files US_weather_2015 and weather_stations correspondingly contain temperature of the United States at different stations and and their locations (marked by longitude, latitude and elevation) from January 1 to December 31, 2015, a period of 365 days. Given spatial statistics models generally only take longitude and latitude into consideration, we substract the variable elevation (elev) and only take the temperature of one elevation into consideration in the same time at the same location (including stn, lon and lat). Furthermore, to facilitate analysis, we should merge two datasets according to the same variable station number (stn). Top of all, we assume that temporal resolution is one day, and hence we are expected to divide the whole data into 365 sub dataframes including the information of each day, and denote them by “day1”, “day2” and so on.

Exploratory data analysis

Data description

The following packages are used in this project.

Read two data files US_weather_2015 and weather_stations

load("weather_data.RData")
load("weather_stations.RData")

Dataframe weather has five variables: stn, year, month, day and temp.

head(weather_data)
##      stn year month day temp
## 1 916520 2015     1   1 83.0
## 2 916700 2015     1   1 86.9
## 3 788460 2015     1   1 81.8
## 4 916600 2015     1   1 84.3
## 5 965650 2015     1   1 79.0
## 6 406370 2015     1   1 47.6

Dataframe loc has seven variables: stn, name, country, state, lat, lon and elev.

head(weather_stations)
##      stn                          name country state     lat     lon elev
## 1 423630 MISSISSIPPI CANYON OIL PLATFO      US    LA  28.160 -89.220   37
## 2 619760   SERGE-FROLOW (ILE TROMELIN)      US       -15.883  54.517   13
## 3 621010                   MOORED BUOY      US        50.600  -2.933 -999
## 4 621110                   MOORED BUOY      US        58.900  -0.200 -999
## 5 621130                   MOORED BUOY      US        58.400   0.300 -999
## 6 621140                   MOORED BUOY      US        58.700   1.100 -999

Merge weather and loc by common variables stn, lat and lon and remove duplicate rows based on all variables.

## Extract the states in the main US territory
states = unique(statesMap$region)
weather_stations_sub = filter(weather_stations, lat < 50 & lat > 25 & lon > (-125) & lon < (-65))
## merge weather and location(loc)
library(dplyr)
weather_data2 = filter(weather_data, stn %in% weather_stations_sub$stn)
matchPos = match(weather_data2$stn, weather_stations_sub$stn)
weather_data2 = cbind(weather_data2, weather_stations_sub[matchPos, c("lon", "lat","elev")])

weather_data2 = na.omit(weather_data2)
## Remove Duplicate Rows based on all variables
weather_data2 = distinct(weather_data2)
## add a variable "num" to identify date
weather_data2 = mutate(weather_data2, num = as.numeric(ISOdate(2015,month,day)-ISOdate(2015,1,1),
                                                       units="days")+1)
weather_data2 = mutate(weather_data2, date = as.Date(paste(year,month,day,sep = "-")))

## Sort data by date, lon and lat
weather_data2 = arrange(weather_data2, num, lon, lat)
weather_data2 = mutate(weather_data2, lon_rank = min_rank(lon), lat_rank = min_rank(lat))
head(weather_data2)
##      stn year month day temp      lon    lat elev num       date lon_rank
## 1 997198 2015     1   1 47.7 -124.850 42.750  0.0   1 2015-01-01        1
## 2 994300 2015     1   1 38.7 -124.740 48.390 30.8   1 2015-01-01      665
## 3 997243 2015     1   1 39.4 -124.730 48.490  3.0   1 2015-01-01     1029
## 4 997696 2015     1   1 32.2 -124.630 47.920  3.0   1 2015-01-01     1393
## 5 727970 2015     1   1 30.5 -124.555 47.938 56.4   1 2015-01-01     1697
## 6 996760 2015     1   1 44.0 -124.530 44.620  0.0   1 2015-01-01     2356
##   lat_rank
## 1   753198
## 2   945565
## 3   948603
## 4   934316
## 5   934985
## 6   832722

Monthly data

At first, we should compute the mean of the temperature by month so that we can obtain a rough idea of the monthly temperature in the United States from the histogram and boxplot.

## Mean vs. Month:
month_mean=weather_data2 %>%
  group_by(month)%>%
  summarise(temp_month = mean(temp, na.rm=TRUE))
## Histogram of Average monthly temperature 
fit <- fitted(loess(month_mean$temp_month ~ month_mean$month))
p_hist <- plot_ly(data = month_mean, x = ~month, y = ~temp_month, type = "bar", showlegend=FALSE,
                  marker=list(color=~month, showscale=FALSE)) %>%
  add_lines(y = fit, showlegend=FALSE, color = 'black') %>%
  layout(title = 'Average monthly temperature in the US in 2015',
         showlegend=FALSE, xaxis = list(side="right", showgrid=FALSE),
         yaxis=list(title = "temp (Fahrenheit)",showgrid=FALSE))
p_hist
## A marker object has been specified, but markers is not in the mode
## Adding markers to the mode...
## Boxplot of US monthly temperature
p_box <- ggplot(weather_data2, aes(month, temp, group = month, color = month, fill = month)) + 
  geom_boxplot(outlier.shape = NA) + 
  scale_fill_gradientn(colours = terrain.colors((12))) + 
  theme(legend.position='none') +
  ggtitle("Boxplot of US monthly temperature")

p_box

In view of the histogram of average monthly data, it can be illustrated that the monthly average temperature, at first, shows a trend of rising and reaches its maximum in July with approximately 75 Fahrenheit, and then gradually falls as winter coming after October. While the boxplot displays the distribution of the temperature in distinct stations during each month. In accordance with the histogram, the median of the temperature in July reaches the peak. Furthermore, the data in July concentrates in the smallest range compared to the rest of months. In addition, the temperature of different locations vary widely in February, which demonstrates, in the coldest season, the temperature is mostlly influenced by longitude and latitude coordinates. On the contrary, as for hot seasons, there is no significant effect on temperature from coordinates. The above two pictures both reflect the trend of temperature over time.

Data of Jan 1, 2015

To familiarize ourselves with the geography of the dataset, we will initially ignore the temporal component of the dataset and examine the spatial distribution of temperatures on a single day. Take day 1 as an example, therefore, we extract data of January 1, 2015.

## Get the data on Jan 1st, 2015
day1 = filter(weather_data2, num==1) 

Temperature distribution image of Jan 1, 2015

Since the temperature points are discrete with respect to the whole space (in the longitude-latitude-axis matrix, only 0.02% of elements are not NA), it is not available to show the United States temperatures by using the image.plot command in the fields package. In this case, we introduce the following method to plot the discrete points’ temperature distribution image.

day1_= day1
day1_$hover <- with(day1_, paste("<br>", "Lon:", lon,  
                                 "<br>","Lat:", lat, 
                                 "<br>","Elev:", elev,
                                 "<br>", "Temp:", temp))
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('white'))

p1 <- plot_geo(day1_, lat = ~lat, lon = ~lon) %>%
  add_markers(
    text = ~hover, showlegend=FALSE,
    marker=list(color = ~temp, showscale=FALSE),
    hoverinfo = "text") %>%
  layout(title = 'US temp (Fahrenheit) on Jan 1st, 2015',
         geo = g, showlegend=T)
library(plotly)
ggplotly(p1)

From the above discrete points’ temperature distribution image, it seems that it was deep winter across large parts of the United States on January 1st, 2015. Northern and northwestern parts have very cold temperatures, especially in New England, the Midwest, and the Mid-Atlantic states. Though temperatures were typically milder here than in the North and Midwest, the weather in states in the Southeast and Southwest were still cold. A pronounced temperature gradient is visible from highs of over 85.6 Fahrenheit in the north of the study area which is near to the equator to a low of -14.7 Fahrenheit towards the southern boundary. This is not only indicative of spatial correlation in the dataset, but it also shows that the data are not stationary, as the mean temperature must vary strongly with latitude. With the exception of a few regions along the western coastline of the United States, temperature decreased with latitude being higher. And there could be a nonlinear relationship between temperature and longitude. Apart from longitude and latitude coordinates, temperatures usually were connected to elevation, with stations at higher elevations being colder. Next, to explore the trend of the temperature affected by various factors, we can draw images of temperature changing with longitude, latitude and altitude respectively.

Mean and variance of US temperature by latitude on Jan 1 in 2015

library(graphics)
## Mean vs. Latitude:
lat_mean=day1 %>%
  group_by(lat)%>%
  summarise(temp_lat = mean(temp, na.rm=TRUE)) 

f1 = ggplot(data = lat_mean) + 
  geom_point(mapping = aes(x = lat, y = temp_lat))+
  geom_smooth(mapping = aes(x = lat, y = temp_lat)) +
  labs(x = "lat",y = "temp", title = 'Mean temp by Latitude')

## Variance vs. Latitude:
lat_var=day1 %>%
  group_by(lat)%>%
  summarise(temp_var_lat = var(temp, na.rm=TRUE))
f2 = ggplot(data = lat_var) + 
  geom_point(mapping = aes(x = lat, y = temp_var_lat))+
  geom_smooth(mapping = aes(x = lat, y = temp_var_lat))+
  labs(x = "lat",y = "variance of temp", title = 'Variance in temp by Latitude')
grid.arrange(f1, f2, ncol=2)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Mean and variance of US temperature by longitude on Jan 1 in 2015

## Mean vs. Longitude:
lon_mean=day1 %>%
  group_by(lon)%>%
  summarise(temp_lon = mean(temp, na.rm=TRUE)) 
f3 = ggplot(data = lon_mean) + 
  geom_point(mapping = aes(x = lon, y = temp_lon))+
  geom_smooth(mapping = aes(x = lon, y = temp_lon))+
  labs(x = "lon",y = "temp", title = 'Mean temp by Longitude')

## Variance vs. Longitude:
lon_var=day1 %>%
  group_by(lon)%>%
  summarise(temp_var_lon = var(temp, na.rm=TRUE))
f4 = ggplot(data = lon_var) + 
  geom_point(mapping = aes(x = lon, y = temp_var_lon))+
  geom_smooth(mapping = aes(x = lon, y = temp_var_lon))+
  labs(x = "lon",y = "variance of temp", title = 'Variance in temp by Longitude')
grid.arrange(f3, f4, ncol=2)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Mean and variance of US temperature by elevation on Jan 1 in 2015

## Mean vs. Elevation:
elev_mean=day1 %>%
  group_by(elev)%>%
  summarise(temp_elev = mean(temp, na.rm=TRUE)) 
f5 = ggplot(data = elev_mean) + 
  geom_point(mapping = aes(x = elev, y = temp_elev))+
  geom_smooth(mapping = aes(x = elev, y = temp_elev))+
  labs(x = "elev",y = "temp", title = 'Mean temp by Elevation')

## Variance vs. Elevation:
elev_var=day1 %>%
  group_by(elev)%>%
  summarise(temp_var_elev = var(temp, na.rm=TRUE))
f6 = ggplot(data = elev_var) + 
  geom_point(mapping = aes(x = elev, y = temp_var_elev))+
  geom_smooth(mapping = aes(x = elev, y = temp_var_elev))+
  labs(x = "elev",y = "variance of temp", title = 'Variance in temp by Elevation')
grid.arrange(f5, f6, ncol=2)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

These three figures illustrate the non-stationarity of the temperature, so it is necessary to figure out the temperature trend before selecting a feasible covariance model. As latitude is away from the equator, the mean temperature falls. The fitted mean and variance curves across longitudes further demonstrate the nonlinear relationship between temperature and longitude. Maybe affected by a small number of extreme values, the variance in certain location is extraordinarily large. Perhaps the figures of a certain day cannot provide sufficient information for the trend. Therefore, we draw the corresponding mean(variance) fitted curves between the temperature and some explanatory variables (including longitude and latitude) of 12 months (see the animation). It can be found the figures of each month display a similar pattern that the temperature decreases linearly with latitude, while the relationship between temperature and longitude is non-linear. In this case, it makes sense to use the same form of a mean function to predict the temperature of over the grid on each day, based on the observations.

Research objectives

In light to images of temperature changing with longitude, latitude, altitude and month, we can confirm that there are some trends in location and time marginally. For simplicity, in variogram modeling and kriging throughout this paper, we will treat the latitude and longitude coordinates as if they are Cartesian. In geostatistics, we generally consider the temperature is normally distributed with mean and covariance structure being functions of latitude, longitude and time. Furthermore, the mean function may be the linear combination or other nonlinear forms of three factors. Apart from this, the covariance may introduce the \(Mat\acute{e}rn\) class, which is popular in spatial statistics. Considering these cases, we will figure out the structure of the normal distribution by the following steps. 1. Construct spatial data analysis regardless of the influence of time and find the most appropriate model. 2. Construct Spatial-Temporal analysis. 3. Various forms of Kriging can be used to attempt to fill gaps so that we can obtain the complete temperature variation diagram across the United States.

Spatial analysis of the temperature

The spatial model

In the function “spatialProcess” of the package “fields”, it assumes a spatial model \[Y_k=P(x_k)+Z(x_k)\beta+g(x_k)+e_k\] The estimated surface is the best linear unbiased estimate (BLUE) of the first three terms given observed data.

  • Notation

    • \(Y_k\): dependent variable (i.e. temperature) at location \(x_k\).

    • \(P(\cdot)\): a low degree polynomial (default is a linear function, we can adjust the order of polynomial by mkrig.args).

    • \(Z(\cdot)\): a matrix of covariates in a linear model (the elevation can be put in this part).

    • \(g(\cdot)\): a mean zero Gaussian process with covariance function \(K\) defined by the marginal process variance \(\rho\) and the range parameter \(\theta\), which means \(K(\rho,\theta) = \rho\cdot \text{Cov}(\theta)\).

    • \(e_k\): independent normal error, \(e_k \sim N(0,\sigma^2)\), where \(\sigma^2\) is called the nugget variance.

Define train data and test data

Longitude, latitude, altitude and corresponding temperature were extracted from the data “day1”, 80% of which was training set and the rest was test set.

X_new = day1[,6:7]
Y_new = day1[5]
Z_new = day1[,8]
set.seed(123)
index = sample(nrow(Y_new),nrow(Y_new)*0.8)
train_X = X_new[index,]
test_X = X_new[-index,]
train_y = as.matrix(Y_new)[index]
test_y = as.matrix(Y_new)[-index]
train_Z = Z_new[index]
test_Z = Z_new[-index]

Model fitting

Model 1: Assume quadratic polynomial trend in P

library(fields)
par_est = spatialProcess(train_X,train_y,Z=train_Z,
                         mKrig.args = list(m = 3),
                         Distance = "rdist.earth",
                         cov.args = list(Covariance = "Matern",
                                         smoothness = 1))
# where m=3 means quadratic form
print(par_est)
## Call:
## spatialProcess(x = train_X, y = train_y, Z = train_Z, mKrig.args = list(m = 3), 
##     cov.args = list(Covariance = "Matern", smoothness = 1), Distance = "rdist.earth")
##                                                          
##  Number of Observations:                          1850   
##  Degree of polynomial in fixed part:              2      
##  Total number of parameters in fixed part:        7      
##  Number of additional covariates (Z)              1      
##  MLE nugget variance ( sigma^2)                   4.565  
##  MLE process variance (rho)                       60.86  
##  MLE range parameter (theta, units of distance):  157.1  
##  Approx 95% lower bound:                          131    
##             upper bound:                          207.8  
##   Approx.  degrees of freedom for curve           609.2  
##     Standard Error of df estimate:                7.066  
##  Nonzero entries in covariance                    3422500
##  
##  Covariance Model: stationary.cov
##    Covariance function:   Matern
##    Non-default covariance arguments and their values 
##    Argument: Covariance  has the value(s): 
## [1] "Matern"
##    Argument: smoothness  has the value(s): 
## [1] 1
##    Argument: Distance  has the value(s): 
## [1] "rdist.earth"
##    Argument: theta  has the value(s): 
##    theta 
## 157.1435 
##    Argument: onlyUpper  has the value(s): 
## [1] FALSE
##    Argument: distMat  has the value(s): 
## [1] NA
sigma.square = as.numeric(par_est$sigma.MLE)^2  # nugget variance (sigma^2)
rho = as.numeric(par_est$rho.MLE)               # process variance (rho) 
theta = as.numeric(par_est$theta.MLE)        # range parameter(theta) (miles)

Model 2: Assume linear polynomial trend in P

par_est2 = spatialProcess(train_X,train_y,Z=train_Z,
                         mKrig.args = list(m = 2),
                         Distance = "rdist.earth",
                         cov.args = list(Covariance = "Matern",
                                         smoothness = 1))
# where m=2 means linear form
print(par_est2)
## Call:
## spatialProcess(x = train_X, y = train_y, Z = train_Z, mKrig.args = list(m = 2), 
##     cov.args = list(Covariance = "Matern", smoothness = 1), Distance = "rdist.earth")
##                                                          
##  Number of Observations:                          1850   
##  Degree of polynomial in fixed part:              1      
##  Total number of parameters in fixed part:        4      
##  Number of additional covariates (Z)              1      
##  MLE nugget variance ( sigma^2)                   4.733  
##  MLE process variance (rho)                       139.8  
##  MLE range parameter (theta, units of distance):  257.7  
##  Approx 95% lower bound:                          198.8  
##             upper bound:                          384    
##   Approx.  degrees of freedom for curve           571.9  
##     Standard Error of df estimate:                6.899  
##  Nonzero entries in covariance                    3422500
##  
##  Covariance Model: stationary.cov
##    Covariance function:   Matern
##    Non-default covariance arguments and their values 
##    Argument: Covariance  has the value(s): 
## [1] "Matern"
##    Argument: smoothness  has the value(s): 
## [1] 1
##    Argument: Distance  has the value(s): 
## [1] "rdist.earth"
##    Argument: theta  has the value(s): 
##    theta 
## 257.6591 
##    Argument: onlyUpper  has the value(s): 
## [1] FALSE
##    Argument: distMat  has the value(s): 
## [1] NA
sigma.square = as.numeric(par_est$sigma.MLE)^2  # nugget variance (sigma^2)
rho = as.numeric(par_est$rho.MLE)               # process variance (rho) 
theta = as.numeric(par_est$theta.MLE)       # range parameter (theta) (miles)

Model evaluation

## The maximum distance among all pairs 
out=rdist.earth(cbind(train_X$lon,train_X$lat),miles = T)
range(out)[2]   
## [1] 3167.165

Considering the maximum distance among all pairs of train data, the “par_est” and “par_est2” given by two models seem to be feasible.

  • d: These are coefficients of the polynomial fixed part and the covariates Z, the first 6 numbers are coefficients of polynomial part, while the last term is the coefficient of the extra covariate Z.

  • eff.df: It is estimate of effective degrees of freedom which has one to one mapping with \(\lambda=\sigma^2/\rho\). If the eff.df is very small, this suggests that the surface is well represented by a low order polynomial. Otherwise, if the eff.df approaches to the line number of train data, then the surface is close to interpolating the observations and suggests a small or aero value for the nugget variance.

  • lnProfileLike: log Profile likelihood for lambda.

  • RMSE: relative mean square prediction error.

As for quadratic polynomial trend

par_est$d
##              [,1]
## [1,] 566.58554323
## [2,]   6.22423087
## [3,] -10.68986435
## [4,]   0.02730697
## [5,]  -0.02599832
## [6,]   0.07907762
## [7,]  -0.00386215

The coefficient of Z is so small that it may be ignored.

par_est$eff.df
## [1] 609.1969
par_est$lnProfileLike
## [1] -4691.364
pre = predict(par_est, xnew = test_X, Z = test_Z)
RMSE = mean((pre-test_y)^2)/median(test_Z)
RMSE
## [1] 0.03354896
set.panel(2,2)
## plot window will lay out plots in a 2 by 2 matrix
plot(par_est)  

As for linear polynomial trend

par_est2$d
##               [,1]
## [1,] 105.311988795
## [2,]  -0.154836179
## [3,]  -2.111269961
## [4,]  -0.003804903

The coefficient of Z is so small that it may be ignored.

par_est2$eff.df
## [1] 571.8559
par_est2$lnProfileLike
## [1] -4702.465
pre2 = predict(par_est2, xnew = test_X, Z = test_Z)
RMSE2 = mean((pre2-test_y)^2)/median(test_Z)
RMSE2
## [1] 0.0332631
set.panel(2,2)
## plot window will lay out plots in a 2 by 2 matrix
plot(par_est2)  

From above output, we can know that the log Profile likelihood for lambda of model 1 is larger than model 2, and this implies the good fitness of the former. Furthermore, the behavior of RMSE in model 1 is better than that in model 2. In view of goodness of fit, relatively small RMSE and interpretability, we choose model 1. In fact, the covariance model also can be selected, but the prediction effect of candidate models have no obvious difference, so the \(Mat\acute{e}rn\) model with flexible parameters is used to fit the model of the whole data set.

Redefine the train data

train_X = X_new
train_y = Y_new
train_Z = Z_new
## parameter estimation
#(fields use great circle distances, approxiamte the earth by a ball)
par_est_whole = spatialProcess(train_X,train_y,Z=train_Z,
                               mKrig.args = list(m = 3),
                               Distance = "rdist.earth",
                               cov.args = list(Covariance = "Matern",smoothness = 1))
# where m=3 means quadratic form
print(par_est_whole)
sigma.square = as.numeric(par_est_whole$sigma.MLE)^2     # nugget variance (sigma^2)  4.225
rho = as.numeric(par_est_whole$rho.MLE)                  # process variance (rho)   60.12
theta = as.numeric(par_est_whole$theta.MLE)              # range parameter (theta)  150.646(miles) 

Construct the prediction region

## Create testing data by meshgrid  resolution of (60/50)° x (25/50)° of longitude and latitude
lon_num = 50
lat_num = 50
lon_seq = seq(-125,-65,length.out=lon_num)
lat_seq = seq(25,50,length.out=lat_num)

## Restrict the grid to the USA mainland
require(mapdata) 
tmp <- map('worldHires', 'Usa', fill=TRUE, plot=FALSE) 
require(maptools) 
US.boundary <- map2SpatialPolygons(tmp, IDs = tmp$names, proj4string = CRS(as.character(NA))) 
US.bbox.grid <- SpatialPoints(cbind(rep(lon_seq,length(lat_seq)), 
                                    rep(lat_seq,each=length(lon_seq)))) 
gridded(US.bbox.grid) <- TRUE
US.grid <- US.bbox.grid[!is.na(over(US.bbox.grid, US.boundary)),] 
Z = as.matrix(rep(mean(Z_new),length(US.grid@coords[,1])))
pre_whole = predict(par_est_whole, xnew = US.grid@coords,Z = Z) # Z is not necessary

Plot of krigged temperature

df=as.data.frame(US.grid)
df$z <- pre_whole
colnames(df)=c("lon","lat","temp")

library(ggplot2)
statesMap <- map_data("state")

library(viridis)
ggplot(df, aes(lon, lat,fill = temp)) + 
  geom_raster(interpolate = F) + 
  scale_fill_continuous(type="viridis",alpha=0.9) +
  geom_polygon(data = statesMap, aes(x=long, y = lat, group = group), color = "black",fill="grey", alpha=0) +
  labs(x = "lon",y = "lat", title = 'Krigged US temp (Fahrenheit) on Jan 1st, 2015') + 
  theme(plot.title = element_text(hjust = 0.4))

spatial pattern of the residuals after fitting

library(plotly)
testdata = cbind(X_new,Z_new,par_est_whole$residuals)
colnames(testdata)=c("lon","lat", "elev", "temp")

testdata$hover <- with(testdata, paste("<br>", "Lon:", lon,  
                                       "<br>","Lat:", lat, 
                                       "<br>","Elev:", elev,
                                       "<br>", "ResTemp:", temp))
bubble = plot_ly(testdata, x = ~lon, y = ~lat) %>%
  add_markers(
    text = ~hover, showlegend=FALSE,
    marker=list(color = ~temp, 
                size = ~2*abs(temp), opacity = 0.5, 
                type = 'scatter', mode = 'markers', showscale=FALSE),
    hoverinfo = "text") %>%
  layout(title = 'Residual from two order polynomial trend with Matern class covariance',
         xaxis = list(showgrid = FALSE),
         yaxis = list(showgrid = FALSE))
bubble
range(par_est_whole$residuals)
## [1] -13.14837  18.24184

The range of the fitted residuals is \((-13.14837,18.24184)\), and the bubble diagram reflects the overall fitted effect of the spatial model with the quadratic polynomial trend and \(Mat\acute{e}rn\) covariance, except that a few coordinates have a large deviation in the predicted value which could be a result of high elevation.

Spatial-temporal analysis of the temperature

Although the model above is adept at extracting spatial information, temperature is also affected by the seasons, so it makes sense to take time into account. Furthermore, it should be noted that since the temperature data does not follow the normal distribution, we should compute the spatial trend of the temperature rather than use the temperature directly. The following code simultaneously computes the residual and the temperature prediction of gridded points for each day during the peorid of the first 10 days in 2015.

Construct spatial(temporal) object and class dataframe

## Unique longitude and latitude coordinates of all data
lon.lat.1to10 <- weather_data3[,c(1,6,7)]
lon.lat.1to10.uni <- distinct(lon.lat.1to10,lon,lat,.keep_all = T)
### spatial object
sp.1to10 = lon.lat.1to10.uni[,c(2,3)]
row.names(sp.1to10) = lon.lat.1to10.uni$stn
coordinates(sp.1to10)=~lon+lat
library(sp)
sp.1to10 = SpatialPoints(sp.1to10,
                         proj4string=CRS("+init=epsg:4326 +proj=longlat +ellps=WGS84 
                                         +datum=WGS84 +no_defs +towgs84=0,0,0"))
### temporal object
time = as.Date("2015-01-01")+0:9
### object of class data.frame
library(dplyr)
temp.1to10 <- NULL
for(i in 1:10){
  day <- filter(weather_data3, num==i)
  dat <- day[,c(8,9,6,7,5)]
  out <- distinct(dat,lon,lat,.keep_all = T)
  tmp = left_join(lon.lat.1to10.uni,out,by = c("lon","lat"))
  tmp = arrange(tmp,lon,lat)
  temp.1to10 = rbind(temp.1to10, tmp)
  
}
temp.1to10 = temp.1to10[,c(1,4,5,2,3,6)]
temp.data = data.frame(temp.1to10$temp)
colnames(temp.data)="temp"

Generate STFDF object

The STFDF object is constructed to facilitate the further spatial-temporal analysis. Since it is not possible to allocate a large amount of space in R, and the time dependence of temperature is short, we only trandform the first 10 days data into a STFDF object (a class for spatial-temporal data with full space-time grid).

## generate STFDF
library(spacetime)
sp.1to10.time = STFDF(sp.1to10, time, temp.data, endTime = delta(time))
na.stations <- which(apply(as(sp.1to10.time, "xts"), 2, function(x) sum(is.na(x))>0))
sp.1to10.time = sp.1to10.time[-na.stations,]
load("sp.1to10.time.RData")
## summarize the structure of sp.1to10.time
summary(sp.1to10.time)
## Object of class STFDF
##  with Dimensions (s, t, attr): (2263, 10, 1)
## [[Spatial:]]
## Object of class SpatialPoints
## Coordinates:
##         min     max
## lon -124.85 -65.930
## lat   25.01  49.318
## Is projected: FALSE 
## proj4string :
## [+init=epsg:4326 +proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs
## +towgs84=0,0,0]
## Number of points: 2263
## [[Temporal:]]
##      Index              timeIndex    
##  Min.   :2015-01-01   Min.   : 1.00  
##  1st Qu.:2015-01-03   1st Qu.: 3.25  
##  Median :2015-01-05   Median : 5.50  
##  Mean   :2015-01-05   Mean   : 5.50  
##  3rd Qu.:2015-01-07   3rd Qu.: 7.75  
##  Max.   :2015-01-10   Max.   :10.00  
## [[Data attributes:]]
##       temp       
##  Min.   :-21.70  
##  1st Qu.: 17.10  
##  Median : 30.80  
##  Mean   : 30.52  
##  3rd Qu.: 43.60  
##  Max.   : 79.90

Calculate spatial-temporal sample variogram

## Investigating spatio-temporal structure
vst <- variogramST(temp ~ polym(lon, lat, degree=2, raw=TRUE), sp.1to10.time)
vst2 <- variogramST(temp ~ polym(lon, lat, degree=2, raw=TRUE), sp.1to10.time, tlags=0:4)
vst3 <- variogramST(temp ~ polym(lon, lat, degree=3, raw=TRUE), sp.1to10.time, tlags=0:4)
vst4 <- variogramST(temp ~ polym(lon, lat, degree=1, raw=TRUE), sp.1to10.time, tlags=0:4)
head(vst)
##       np      dist    gamma   id timelag  spacelag   avgDist
## 1      0        NA       NA lag0       0   0.00000   0.00000
## 2 211400  80.79321 216.4033 lag0       0  63.43853  80.79321
## 3 481010 194.86145 246.6762 lag0       0 190.31560 194.86145
## 4 670610 319.89921 267.0799 lag0       0 317.19267 319.89921
## 5 827210 446.08290 280.6465 lag0       0 444.06974 446.08290
## 6 959500 572.40847 288.6045 lag0       0 570.94680 572.40847
summary(vst)
##        np               dist            gamma             id           
##  Min.   :      0   Min.   :   0.0   Min.   : 39.94   Length:160        
##  1st Qu.: 397976   1st Qu.: 446.1   1st Qu.:282.26   Class :character  
##  Median : 951718   Median : 952.3   Median :328.38   Mode  :character  
##  Mean   :1010450   Mean   : 899.5   Mean   :311.78                     
##  3rd Qu.:1460308   3rd Qu.:1395.7   3rd Qu.:359.30                     
##  Max.   :2543130   Max.   :1838.7   Max.   :394.89                     
##                    NA's   :1        NA's   :1                          
##     timelag       spacelag         avgDist      
##  Min.   :0.0   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:2.0   1st Qu.: 412.4   1st Qu.: 414.5  
##  Median :4.5   Median : 888.1   Median : 888.9  
##  Mean   :4.5   Mean   : 892.1   Mean   : 893.9  
##  3rd Qu.:7.0   3rd Qu.:1363.9   3rd Qu.:1364.0  
##  Max.   :9.0   Max.   :1839.7   Max.   :1838.7  
## 
  • Notation

    • timelag: the time lag used

    • spacelag: the midpoint in the spatial lag intervals as passed by the parameter boundaries

    • avgDist: the average distance between the point pairs found in a distance interval over all temporal lags (i.e. the averages of the values dist per temporal lag).

vstp1 = plot(vst, xlab="separation (km)", ylab="separation (+days)", main="(a) Semivariance, degree=2, Temp")
vstp2 = plot(vst2, xlab="separation (km)", ylab="separation (+days)", main="(b) Semivariance, degree=2, Temp")
vstp3 = plot(vst3, xlab="separation (km)", ylab="separation (+days)", main="(c) Semivariance, degree=3, Temp")
vstp4 = plot(vst4, xlab="separation (km)", ylab="separation (+days)", main="(d) Semivariance, degree=1, Temp")
grid.arrange(vstp1, vstp2, vstp3, vstp4, ncol=2)

vstl1 = plot(vst, map = FALSE, xlab="separation (km)", ylab = "Semivariance, Temp", main="(a) degree=2")
vstl2 = plot(vst2, map = FALSE, xlab="separation (km)", ylab = "Semivariance, Temp", main="(b) degree=2")
vstl3 = plot(vst3, map = FALSE, xlab="separation (km)", ylab = "Semivariance, Temp", main="(c) degree=3")
vstl4 = plot(vst4, map = FALSE, xlab="separation (km)", ylab = "Semivariance, Temp", main="(d) degree=1")
grid.arrange(vstl1, vstl2, vstl3, vstl4, ncol=2)

Since we have only 10 days of data, the model vst which has ten time lags seems not appropriate (see figure (a)). While model vst2 and vst3 have a better behavior as a result of the semivariogram of the two models rapidly converge to the maximum. It should be noted that these figures plot the semivariogram of residuals after removing trend (formulized by “formula”). Therefore, except a very short spatial interval, the semivariogram is pretty large, and the covariance is 0, indicating that points beyond a certain spatial lag have no correlation and the spatial trend is extracted well. For interpretability, we choose model vst2.

Construct a spatio-temporal variogram of a given type

We consider the separable covariance model and then get the fitted variogram from model vst2. The separable model has separate spatial and temporal structures, which are considered to interact only multiplicatively and so can be fit independently but with a common sill. The covariance structure is \[C(h,u)=C_s(h)\cdot C_t(u)\] where \(s\) and \(t\) are space and time index respectively, and \(h\) and \(u\) are the space lag and time lag.

(estimated.sill <- quantile(na.omit(vst2$gamma), .8))
(vgm.sep <- vgmST(stModel="separable", space=vgm(0.9,"Exp", 300,0.1), 
                  time=vgm(0.95,"Exp", 2, 0.05), sill=estimated.sill))
vgmf.sep <- fit.StVariogram(vst2, vgm.sep, method="L-BFGS-B", 
                            lower=c(100,0.001,1,0.001,40), 
                            control=list(maxit=500)) 
load("vgmf.sep.RData")
attr(vgmf.sep, "optim.output")$par
##     range.s    nugget.s     range.t    nugget.t        sill 
## 210.0587236   0.6172928   7.5209294   0.0010000 340.1618029

The result means that points beyond about 210 km have no significant relationship, and the temperature is hardly influenced by that after 7days. The nugget effects in time and space are both small, which guarantee a smoother temperature surface.

attr(vgmf.sep, "optim.output")$value 
## [1] 823.1293

The goodness-of-fit is expressed by the error sum of squares in the value attribute of the fitted model.

plot(vst, vgmf.sep) 

plot(vst, vgmf.sep, map=FALSE)

colnames(US.grid@coords) = c("lon","lat")
US.grid <- STF(sp=as(US.grid,"SpatialPoints"), time=sp.1to10.time@time)
proj4string(US.grid)=proj4string(sp.1to10.time)

k.de.sep <- krigeST(temp~polym(lon, lat, degree=2, raw=TRUE),
                    data=sp.1to10.time, newdata=US.grid, modelList=vgmf.sep) 
gridded(k.de.sep@sp) <- TRUE
load("k.de.sep.RData")
plot.zlim <- seq(floor(min(k.de.sep$var1.pred)), 
                 ceiling(max(k.de.sep$var1.pred)), by = 0.5)
stplot(k.de.sep, main="The temperature of the first 10 days in 2015", 
       sub="Separable space-time model", 
       col.regions=bpy.colors(length(plot.zlim)), at=plot.zlim) 

day1.sp = SpatialPoints(cbind(day1$lon,day1$lat))
colnames(day1.sp@coords) = c("lon","lat")
day1.sp <- STF(sp=as(day1.sp,"SpatialPoints"), time=sp.1to10.time@time)
proj4string(day1.sp)=proj4string(sp.1to10.time)

k.de.sep.day1.fit <- krigeST(temp~polym(lon, lat, degree=2, raw=TRUE),
                    data=sp.1to10.time, newdata=day1.sp, modelList=vgmf.sep) 
gridded(k.de.sep.day1.fit@sp) <- TRUE
save(k.de.sep.day1.fit,file="k.de.sep.day1.fit")
day1.fit = k.de.sep.day1.fit[,"2015-01-01"]$var1.pred
day1.res = day1.fit - day1$temp

spatial pattern of the residuals after fitting

library(plotly)
testdata2 = cbind(X_new,Z_new,day1.res)
colnames(testdata2)=c("lon","lat", "elev", "temp")

testdata2$hover <- with(testdata2, paste("<br>", "Lon:", lon,  
                                       "<br>","Lat:", lat, 
                                       "<br>","Elev:", elev,
                                       "<br>", "ResTemp:", temp))
bubble2 = plot_ly(testdata2, x = ~lon, y = ~lat) %>%
  add_markers(
    text = ~hover, showlegend=FALSE,
    marker=list(color = ~temp, 
                size = ~0.3*abs(temp), opacity = 0.5, 
                type = 'scatter', mode = 'markers', showscale=FALSE),
    hoverinfo = "text") %>%
  layout(title = 'Residual from spatial-temporal model with separable covariance',
         xaxis = list(showgrid = FALSE),
         yaxis = list(showgrid = FALSE))
bubble2

It can be concluded from the residual diagram after fitting that the spatial-temporal model is not very suitable, since the fitting temperature at high latitudes is higher than the true value while for low latitudes, the contrary is the case. Maybe the hypothesis of separability is not correct, therefore, in the following study, we need to explore a more reasonable assumption of covariance structure.